[https://nvbugs/6330273][fix] Reserve KV cache slots for concurrent decode in V2#15462
[https://nvbugs/6330273][fix] Reserve KV cache slots for concurrent decode in V2#15462Kevin-Li-2025 wants to merge 3 commits into
Conversation
📝 WalkthroughWalkthrough
ChangesConcurrent decode constraint for windowed KV cache pools
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
5755532 to
36634a6
Compare
|
@lowsfer could you review this? |
0c8e844 to
75e8beb
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #55425 [ run ] triggered by Bot. Commit: |
|
PR_Github #55425 [ run ] completed with state
|
|
Thanks for triggering CI. I cannot access the internal failed test details from the L0 report, but the public GitHub wrapper shows I pushed
This should avoid forcing impossible min-slot constraints in small-budget test configurations while preserving the intended fix for the high-concurrency windowed-pool deadlock case. I also added a focused unit check for the bounded case. Local checks I could run here:
I could not run the full target pytest locally because this checkout does not have the full TensorRT-LLM test dependency environment installed ( |
ffd097c to
062bc37
Compare
|
Small correction: I amended the fix with DCO sign-off and force-pushed it as |
Signed-off-by: Kevin-Li-2025 <2242139@qq.com>
Signed-off-by: Kevin-Li-2025 <2242139@qq.com>
Signed-off-by: Kevin-Li-2025 <2242139@qq.com>
062bc37 to
207015d
Compare
|
I rebased the branch onto current Pushed new head: Local checks:
The public GitHub checks currently show DCO passing; NVIDIA internal L0 still needs to be re-triggered/shared by an NVIDIA maintainer if it remains blocked. |
Description
Fixes #15401.
KVCacheManagerV2can under-reserve windowed pool slots when capacity planning only sees long-history requests. For small sliding windows, the stale range can leave a windowed pool with a min-slot floor of 1, which can deadlock scheduling once concurrent decode requests exceed that single slot.This adds a generic concurrent-decode constraint to the V2 cache config:
max_batch_sizerequests at one token block withhistory_length=tokens_per_block - 1. Each decode request needs one slot in every pool group, so this floors the min slots atmax_batch_sizewithout changing scheduler behavior.The config also sets the existing StorageManager fallback typical-step explicitly, so adding the constraint does not accidentally switch ratio selection to constraint-only sizing.
Tests
python3 -m py_compile tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py tests/unittest/_torch/executor/test_per_layer_head_dim.pygit diff --checkI attempted the targeted pytest, but local collection is blocked by a missing
nvtxdependency in this checkout.Summary by CodeRabbit
New Features
Tests